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1 Abstract 

In this paper, we consider the problem of estimating multiple graphical models simultaneously 
£N| ' using the fused lasso penalty, which encourages adjacent graphs to share similar structures. A mo- 

tivating example is the analysis of brain networks of Alzheimer's disease using neuroimaging data. 

■ Specifically, we may wish to estimate a brain network for the normal controls (NC), a brain network 
' for the patients with mild cognitive impairment (MCI), and a brain network for Alzheimer's patients 

(AD). We expect the two brain networks for NC and MCI to share common structures but not to be 
identical to each other; similarly for the two brain networks for MCI and AD. The proposed formu- 
lation can be solved using a blockwise coordinate descent method. Our key technical contribution 
is to establish the necessary and sufficient condition for the graphs to be decomposable. Based on 
this key property, a simple screening rule is presented, which decomposes the large graphs into small 
^ | subgraphs and allows an efficient estimation of multiple independent (small) subgraphs, dramatically 

reducing the computational cost. We perform experiments on both synthetic and real data; our 
results demonstrate the effectiveness and efficiency of the proposed approach. 

1 Introduction 

>; 

Undirected graphical models explore the relationships among a set of random variables through their joint 
distribution. The estimation of undirected graphical models has applications in many domains, such as 

■ computer vision, biology, and medicine. An instance is the analysis of gene expression data. As shown 
in many biological studies, genes tend to work in groups based on their biological functions, and there 
exist some regulatory relationships between genes pQ. Such biological knowledge can be represented as 

| a graph, where nodes are the genes, and edges describe the regulatory relationships. Graphical models 

provide a useful tool for modeling these relationships, and can be used to explore gene activities. One 
of the most widely used graphical models is the Gaussian graphical model (GGM), which assumes the 
variables to be Gaussian distributed [2]. In the framework of GGM, the problem of learning a graph 
is equivalent to estimating the inverse of the covariance matrix (precision matrix), since the nonzero 
off-diagonal elements of the precision matrix represent edges in the graph [2]. 

In recent years many research efforts have focused on estimating the precision matrix and the corre- 
sponding graphical model. Meinshausen and Biihlmann [3] estimated edges for each node in the graph 
by fitting a lasso problem [3] using the remaining variables as predictors. Yuan and Lin [5] and Banerjee 
et al. 2 proposed a penalized maximum likelihood approach using l\ regularization, and used interior 
point optimization to estimate the sparse precision matrix. Friedman et al. [B] introduced a blockwise 
coordinate descent method to solve the same problem, referred to as Graphical lasso (GLasso). Huang 
et al. 17] derived the monotone property of GLasso. Liu et al. [5] introduced a stability-based method for 
choosing the regularization parameters for GLasso. Although GLasso is faster than previous approaches, 
it usually fails to converge with warm-starts. To resolve this issue, Mazumder and Hastie [9] proposed 
a new algorithm called DP-GLasso, each step of which is a box-constrained QP problem. The main 
challenge of estimating a sparse precision matrix is its intensive computation. Witten et al. [TU] and 
Mazumder and Hastie independently derived a simple screening rule, achieving great computational 
gain when the regularization parameter is large. However, these formulations assume that observations 
are independently drawn from a single Gaussian distribution. In many applications the observations 
may be drawn from multiple Gaussian distributions; in this case, multiple graphical models need to be 
estimated. 
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There is some recent work on the estimation of multiple precision matrices. Guo et al. [12] proposed a 
method to jointly estimate multiple graphical models using a hierarchical penalty. However, this method 
is not convex. Danaher et al. |13j estimated multiple precision matrices simultaneously using a pairwise 
fused penalty and grouping penalty. The generalized gradient method was used to solve the problem, but 
it required computing the inverse of precision matrices and checking the positive definiteness of precision 
matrices at each iteration. A screening rule for the two graph case was also proposed in |13j . However, 
it is not clear whether the screening rule can be extended to the more general case with more than two 
graphs, which is the case in brain network modeling. Time-varying graphical models were also studied 
by Zhu et al. Q3], and Kolar et al. ^51 H5] . 

In this paper, we consider the problem of estimating multiple graphical models by maximizing a 
penalized log likelihood with i\ and fused regularization as in |13] . The 1% regularization yields a sparse 
solution, and the fused regularization encourages adjacent graphs to be similar. A motivating example is 
the modeling of brain networks for Alzheimer's disease using neuroimaging data such as Positron emission 
tomography (PET). In this case, we want to estimate graphical models for three groups: normal controls 
(NC), patients of mild cognitive impairment (MCI), and Alzheimer's patients (AD). These networks are 
expected to share some common connections, but they are not identical. Furthermore, the networks are 
expected to evolve over time, in the order of disease progression from NC to MCI to AD. Estimating 
the graphical models separately fails to exploit the common structures among them. It is thus desirable 
to jointly estimate the three networks (graphs) . We employ the blockwise coordinate descent method to 
solve the fused multiple graphical lasso (FMGL), where each step is solved by the accelerated gradient 
method [T7]. Our key technical contribution is to establish the necessary and sufficient condition for the 
FMGL solution to be block diagonal. Based on this key property of FMGL, we develop a screening rule 
which enables the efficient estimation of large multiple precision matrices. Specifically, we derive a set 
of necessary conditions for the solution of FMGL to be block diagonal. We prove that these conditions 
are sufficient when K < 3 (K is the number of graphs to be estimated). Our simulation studies strongly 
indicate that these conditions are also sufficient for any K > 3 as well. Our results significantly extend 
the recent work presented in [13] ■ We conduct experiments on both synthetic and real data; our results 
demonstrate the effectiveness and efficiency of the proposed approach. 

The rest of the paper is organized as follows. We introduce the fused multiple graphical lasso formu- 
lation in Section O The screening rule is presented in Section [3] The experimental results are shown in 
Section 2] We conclude the paper in Section [5] 



2 Fused multiple graphical lasso 

Assume we are given K data sets, X( fe ) 6 lZ nkXp , k — 1, . . . ,K with K > 2, where rik is the number 
of samples, and p is the number of features. The p features are common for all K data sets, and all 
SfcLi n k samples are independent. Furthermore, the samples within each data set X( fc ) are identically 
distributed with a p-variate Gaussian distribution with zero mean and positive definite covariance matrix 
and there are many conditionally independent pairs of features, i.e. the precision matrix 0( fc ) = 
(EW)- 1 should be sparse. For notational simplicity, we assume that n\ = ■ ■ ■ = uk = n. Denote 
the sample covariance matrix for each data set X' fe ) as such that S^ fc ' = ^(XW) T XW, and = 
{©W, . . . , 0( A ')}. Then the negative log likelihood for the data takes the form of 

K 

£ ( ) = J2 (-logdet(0W)+ir(S( fc )©W)) . (1) 

k=l 

Minimizing Eq.([T]) leads to the maximum likelihood estimate (MLE) ©W = (S^) -1 . However, MLE 
fails in the high-dimensional setting. In this setting, the sample size n is less than the number of features 
p, thus is singular. Furthermore, the MLE is unlikely to be sparse. The l\ regularization has been 
employed to induce sparsity, resulting in the sparse inverse covariance estimation [2M5JIS]. In this paper, 
we employ both the t% regularization and the fused regularization for simultaneously estimating multiple 
graphs. The t\ regularization leads to a sparse solution, and the fused penalty encourages ©W to be 
similar to its neighbors. Mathematically, we solve the following formulation: 

K 

min V f-logdei(0 (fe) ) + tr{S {k) & {k) )) +P(@), (2) 

©c=)^o,fc=i...xf— ; v / 
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where 

p(o) = a x x: E i$ J i+^EE i<f - i . 

k=l i^tj fc=l i^j" 

Ai and A2 are nonnegative regularization parameters. The algorithm for solving Eq. ([2]) and the corre- 
sponding complexity analysis are given in Appendix A. 



3 The screening rule for fused multiple graphical lasso 

Witten et al. [10] and Mazumder and Hastie [IT] independently derived a necessary and sufficient condition 
for the solution of a single graphical lasso to be block diagonal. A simple screening test can be used to 
identify the blocks, thus the original graphical lasso problem can be decomposed into several smaller 
problems. When the number of blocks is large, it can achieve massive computational gain. Danaher et 
al. [T3] developed a similar necessary and sufficient condition for fused graphical lasso with two graphs. 
However, it remains a challenge to derive the necessary and sufficient condition for the solution of fused 
multiple graphical lasso to be block diagonal for K > 2 graphs. 

In this section, we first present a theorem demonstrating that FMGL can be decomposable once its 
solution is block diagonal. Then we derive a set of necessary conditions for the solution of FMGL to be 
block diagonal. We also prove that these conditions are sufficient when K = 3. We conjecture that these 
conditions are sufficient for any K > 3 as well (see the discussion in Section l4.2.2p . 

Let C\, . . . , Cl be a partition of the p features into L nonoverlapping sets, with C/ R Cy = 0, VZ ^ V 
and U^=i @l — {!>••• Then we have the following result [IB] : 

Theorem 1. Suppose that the FMGL solution is block diagonal with L known blocks only consisting 
of features in the set Cj, I = 1, . . . , L, i.e. each estimation precision matrix takes the form 

©«= ■.. ,k = l,...,K, 

{ 

then Eq. |2]) can be solved by applying FMGL on just the corresponding set of features: 

K 

&i =argminE(-log^(0 i (fc) ) + ir(S| fc) 0[ fe) )) + P(®i), I = 1, . . . ,L, 

' k=l 

where and Sj are the corresponding |C/| x \Ci\ symmetric submatrices of 0( fe ) and S( fc ) . 

Theorem □ can be directly derived from Eq.©, since det(&^) = nf=i ^(0; (fc) ), tr(S^&^) = 

Ya=i * r (Sj ©j )j and P(0) = J2f=i P(®i)- The key is how to efficiently identify the block structures. 
We address this problem in the remaining part of this section. 

Theorem 2. The following set of conditions are necessary for the FMGL solution &^ , k = 1, . . . , K to 
be block diagonal with L known blocks Ci,l = 1, . . . , L: 

t 

lE s S } l < iA i+ A 2, l<t<K-l, 

k=l 
ta 

I E S S° I < (*2 - ti + l)Ai + 2A 2 , 2 < h < t 2 < K - 1, 

(3) 

< (K-t + l)X 1 + X 2 , 2<t<K, 



k=t 
K 



k=l 



forieCuj eQ,,l^l'. 
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Proof. Denote the inverse of ©w as W^ fc ', for k = l,...,K. For the diagonal elements of ©w, the 
Karush-Kuhn- Tucker (KKT) optimality conditions [TH] for Eq.@ are 

-4 fe) + s u ] =0,l<k<K. 

Thus, the optimal u>£- can be directly computed as s^. For the off-diagonal elements of ®( k \ the KKT 
conditions for Eq.@ can be written as 

- u#> + 4 fc) + A l7 g° + X 2 (-v^ k) + v^ k+1) ) = 0,for 2 < k < K 1 (4) 

- U?y + + Al7y - A 2 Vy - U, 





H Ai 7 g ) + 


\ (1,2) 


= 




(fc) 




■A 2 (-«g 


-M) +w (fe, fe+ i )) 


= 0, for 2 < fc < K - 1 


(K) 


+ Ai7g° 


- A 2 ^ 


-^ = 0. 





where 7 *f is the subgradient of \6£>\: j£> = 1 if 9% } > 0, = -1 if 6^> < 0, and 7 ^ e [-1, 1] if 

eff = 0; t>J' fc+1) is the subgradient of |flg° - 6§ +1) | with respect to flg : v (k f k+l) = 1 if flg° > 6 {k+1) , 

^ k+1) = -1 if 0<*> < and vg> k+1) 6 [-1, 1] if 0<*> = 

Note that W( fc ) has the same block diagonal structure as ®( k \ thus fly = u)y = for i e C/,j 6 
C;/,i 7^ Z'. Then Eq.(d]) can be rewritten as 

= 

(5) 



As a result, we have J2k=i s ij = _A i 2k=i 7i.f ~ A 2 u-j' t+1) , for 1 < t < K - 1, implying | £)fc=i s^)] < 
tAi + A 2 , for 1 < t < K — 1 since 7 y , . . . , 7 ^"\ w-j' 2 ', . . . , Uy ~ 1,K ^ g [— 1 ; 1]. Similarly, we can prove the 
other three conditions. □ 

Next, we show that the conditions ([3]) in Theorem 2 are also sufficient for K = 2,3. Danaher et al. [13] 
have proved the sufficiency when K = 2. We here give a more concise and simpler proof for K = 2. More 
importantly, our proof can be easily extended to the case when K = 3. Before proving the sufficiency, 
we first prove the following lemmas: 

Lemma 1. Suppose \a + (3\ < (ti + l)Xi, \a\ < Ai + i 2 A 2 , and |/3| < tiXi + t 2 X 2 with fi,t 2 > 0, the 
following three intervals intersect: (1) \a\ < t 2 X 2 ; (2) — Ai+a < a < Xi+a; (3) — t\Xi— (3 < a < tiX%—f3. 

Proof. We first prove by contradiction that the first and second intervals intersect. If they do not intersect, 
we must have \a\ > Ai+i 2 A 2 , which contradicts with the condition |a| < Ai+i 2 A 2 . Similarly, the first and 
third intervals intersect. Since \a + j3\ < (t\ + l)Ai, we have Xi+a > — t±Xi — j3 and tiXi — (3 > —X\ + a, 
indicating that the second and third intervals intersect. Thus, the three intervals intersect. □ 

Lemma 2. Suppose \yi + y 2 + 2/3 1 < 3Ai. Then the following region 

max{-Ai - -2Ai + y 2 + 2/3} < »i < min{Ai - yi, 2Ai + y 2 + 2/3}, 
max{-Ai + 2/3, -2Ai -y\- y 2 } < a 2 < min{Ai + y 3 , 2Ai -y\- y 2 } 

is a square (a single point is considered as a special case of a square). Let B and C be the lower-left and 
upper-right vertices of this square, then the diagonal BC belongs to the region 

max{-Ai + y 2 , -2Ai - y 1 - y 3 } < a x - a 2 < min{Ai + y 2 , 2Ai - yi - y 3 }. (7) 

The proof is given in Appendix B. 

Lemma 3. Under the conditions ^ and K = 2,3, the linear system in Eq.§5$ has a solution such that 

(1) (K) (1,2) (K-1,K) r_ 1 
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Proof. Eq.© can be rewritten as a matrix form Ax + y = 0, where A = [A 2 Hi^, XiT-kxk] , x = 



,(1,2) 



,v. 



(K-1,K) (1) 



.7, 



, y 



-,(1) 



,s 



Wit 



and 



/ 1 



H 



if — 



-1 1 
-1 



V 



/ Kx(K-l) 



Note that one solution of Ax + y = is x* = [0i x (k-i)i — y T /Ai] T , an d the null space of A is 



B = Null(A) 



L (.K"-l)x(K-l) 



thus the general solution of Ax + y = takes the form of x — Ba + x* with an arbitrary a g TZ 1 . 
Next we need to show that there exists a solution x such that HxH^ < 1 under the conditions © when 
K = 2,3. For notational simplicity, we use yt to represent in the proof. 
K —2: The solution x takes the form of 



a 

— A 2 a-yi 
x Al 
Ai 



|x||oo < 1 can be expressed as \a\ < 1, 



< a < and ~ A ^+^ 2 < a < ^+2^. If these three 



intervals intersect, there exists x such that |jjfc||oo < 1< The problem of finding the desired x is therefore 
transformed to identifying the intersection of the above set of conditions. According to Lemma [TJ these 
three intervals intersect. 

K—3: Following the same idea, we can obtain the conditions for K = 3: 



and 



lail < 1, \a 2 \ < 1, 



Ai + yi Ai - vx 



A 2 

-Ai + y 2 
A 2 

-Ai + y 3 



< CL\ — a 2 < 

< a 2 < 



A 2 

Ai + y 2 
A 2 

Ai + 2/3 



A 2 A 2 
Consider the summation of Eqs. ([TU)) and (jlip . we can obtain 

-2Ai + y 2 + V3 . , 2Ai + y 2 + y 3 
— < ai < - 



A 2 A 2 

The refined feasible region of a\ by combining Eqs.Q with (fT2")) is given by 

max{-Ai - yi, -2Ai + y 2 + y 3 } . . min{Ai - yi,2Ai + y 2 + y 3 } 
< ai < 



A 2 



A, 



(8) 

(9) 
(10) 

(11) 
(12) 

(13) 



Based on \yi\ < Ai + A 2 , \y 2 + 2/3 1 < 2Ai + A 2 , \y% + y 2 + 2/3 1 < 3Ai as well as Lemma[lJ we can show that 
Eqs.©, ([12"]). and |ai| < 1 intersect. Thus, Eq. (IT3|) also intersects with |ai| < 1. 

Similarly, we can also obtain the refined feasible region of a 2 that intersects with \a 2 \ < 1 



max{-Ai + y 3 , -2Ai - y x - y 2 } . . min{Ai + y 3 , 2A X - y x - y 2 } 
< a 2 < . 



A 2 

and the refined feasible region of a% — a 2 , 

max{-Ai + y 2 , -2Ai - yi - y 3 } 
A 2 



< a\ — a 2 < 



min{Ai + y 2 , 2Ai - yi - y 3 } 



(14) 



(15) 



5 




which intersects with the region of |ai| < 1, |a2 < 1. 

Let A represent the feasible region of Eq. ((5J), B be the feasible region of Eas. (fT5)) and (fLl)) . and C be 
the feasible region of Eq. flTSl) (see Figured] for illustration). We have shown that AdB ^ and ADC ^ 0. 
Next, we will show A n B n C ^ 0. 

According to Lemma [21 £> is a square. Let £? and C be the lower-left and upper-right vertices of B, 
then the diagonal BC of 6 belongs to C. Denote a* as the closest point in A to the diagonal BC. a* 
belongs to B, since AnB ^ 0. We can easily prove that a* e AdBriC. If a* is in the diagonal BC, then 



a* G C (see Figure 1(a)). If not, we prove it by contradiction. Suppose a* ^ C, there exists one closest 



point a G A (~l C to the diagonal BC. The distance from a to the diagonal must be less than that of a* 



which contradicts with the condition that a* is the closest point to the diagonal BC (see Figure 1 (b) ) 



Thus, we have proved that A D B PI C ^ 0, and there exists a solution x such that Ixjloo < 1. □ 

Now we are ready to prove the sufficiency, stated in the following theorem: 

Theorem 3. The conditions ^ are sufficient for the FMGL solution & k \k = 1,...,K to be block 
diagonal with L known blocks Ci,l = 1, . . . , L when K = 2, 3. 

Proof. We construct matrices 0( fe ) with the block diagonal structure Ci, I = 1, . . . , L based on the condi- 
tions ([3]), and show that they are a solution to FMGL. The Z-th block diagonal elements 0; , k = 1, . . . , K 
are obtained by performing FMGL on Sj . According to Theorem [TJ the solution 0p , k = 1,...,K 
satisfies the KKT condition. According to Lemma [3l the linear system in Eq. ([5]) has a solution x such 
that Hxlloo < 1 for i G C\, j £ C//, I ^ V under the conditions ©• Obviously, the choice = §ff — 

for i € Ci, j G CV , / 7^ /' satisfies the KKT condition. Hence, & k \k = 1,...,K form a solution to 
FMGL. □ 

According to Theorem[2]and Theorem[3l the conditions ([3]) can be used as screening rule to identify the 
block structure of the FMGL solution. The steps about how to use the conditions Q are standard [TUlIll) : 

1. Construct an adjacency matrix E = I pxp - Set Eij = Eji = if s\j,k = l,...,K satisfy the 
conditions ©. Otherwise, set J3y = Eji = 1. 

2. Identify the connected components of E. Note that the connected components are the partition of 
the p features into nonoverlapping sets. 

An obvious consequence of Theorem [5] and Theorem [3] is that the off-diagonal elements in the z-th 
row and column are zeros, if k — 1, 2, 3 satisfy the conditions ([3]). In addition, 9^ and w^f 1 can be 
directly computed as and . 



4 Experimental results 

In this section, we evaluate the proposed algorithm and screening rule on synthetic datasets and two 
real datasets: ADHD-200 [H] and FDG-PET images [3D]. The experiments are performed on a PC with 
dual-core Intel 3.0GHz CPU and 4GB memory. The code is written in C. 



G 



4.1 Convergence 



To examine the convergence rate of the proposed method, we randomly generate 3 graphs with 10% 
sparsity and p = 500. The matrices S^ k \k = 1,2,3 are constructed in the following way: the diago- 
nal elements of are set to 0.65, the off-diagonal elements are selectively set to 0.65 based on the 
corresponding graphs, and the remaining ones are set to 0. A noise term 0.35((2^) T <2^ is added to 
S^ k \ where Qw is a p x p matrix with standard Gaussian distribution. The sample covariance matrix 
g(fc) _ g(fc) _|_ Q.35(Q( k )) T Q( k ) . Q( k ) is standardized so that the diagonal elements of S^ fc) are 1. Ai varies 
from 0.25 to 0.45 with a step size of 0.1, and A2 is set to 0.1. From Figure^ we can observe that FMGL 
usually converges within 20 rounds (update all p columns/rows). When the regularization parameter 
value is small, FMGL needs more rounds to converge. 
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Figure 2: Convergence curves of FMGL. Ai is set to 0.25,0.35,0.45, and A2 is set to 0.1. 



4.2 Simulation 
4.2.1 Screening rule 

The synthetic covariance matrices are generated as follows. A block diagonal matrix S with L blocks is 
created, and each block is of size (p/L) x (p/L) with all ones. The sample covariance matrices are generated 
as S (fc ) = 0.5S + 0.5(Q^) T Q^ k \ where is a p x p matrix with standard Gaussian distribution. To 
make sure that the solution has L blocks, we vary Ai from 0.2 to 0.5 with a step size of 0.05, and fix 
A2 to 0.2. The convergence criterion is set to le-5, and the maximal iteration number is set to 1000. 
We use the speedup rate t /t s and the error \f a — f s \ to measure the performance of the screening rule, 
where t , t s are the computational times without and with the screening rule, and f Q , f s are the objective 
values without and with the screening rule. The results with varying p are shown in Figure [3] As shown 
in Figure [31 the screening rule can achieve great computational gain. Since the complexity of FMGL is 
0(Kp 3 ), the speedup rate for a FMGL solution with L same size blocks is 0(L 2 ). It can be observed 
that the speedup rate varies from 10 to 55 for L = 5, and from 25 to 320 for L = 10. We can also find 
that the speedup rate decreases when sparsity increases, and increases when p increases. 




Figure 3: The speedup rate and the error without and with the screening rule. 
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4.2.2 Sufficiency when K > 3 

We conduct simulations to examine whether the conditions (0) can guarantee that there exists a solution 
to the constrained linear system Ax + y = 0, s.t. \\x\\oo < 1 when K > 3. The first and last elements 
of y are uniformly drawn from [— Aj — A2, Ai + A2], and the remaining ones are uniformly drawn from 
[— Ai — 2A2, Ai + 2A2]. If the error ||Ax + y|| 2 >le-8, a counterexample is found. We perform 2 million 
replications for K = 3, 4, ... , 10, 15, 20 respectively. About half of the replications give a y satisfying the 
conditions ([3]). For these y, we minimize ||Ax + y 1 1 2 subject to the constraint using the gradient method. 
No counterexample is found, implying that the conditions © are likely sufficient for any K > 3. 



4.2.3 Stability 



We conduct experiments to demonstrate the effectiveness of FMGL. The synthetic sparse precision matri- 
ces are generated in the following way: we set the first precision matrix as 0.25I pxp , where p = 100. 

When adding an edge (i, j) in the graph, we add a to 9^ and 9^ 



j :i 



and subtract a from 9i) and 9^) 



to keep the positive definiteness of 0W, where a is uniformly drawn from [0.1,0.3]. When deleting an 
edge from the graph, we reverse the above steps with a = 9^ . We randomly assign 200 edges for 
0(!) 0(2) ig obtained by adding 25 edges and deleting 25 different edges from 0( 3 ) is obtained 

from 0( 2 ' in the same way. For each precision matrix, we randomly draw n samples from the Gaussian 
distribution with the corresponding precision matrix, where n varies from 40 to 200 with a step of 20. 
We perform 500 replications for each n. For each n, A2 is fixed to 0.08, and Ai is adjusted to make sure 
that the edge number is about 200. The accuracy nd/n g is used to measure the performance of FMGL 
and GLasso, where rid is the number of true edges detected by FGML and GLasso, and n g is the number 
of true edges. The results are shown in Figure |U We can see from the figure that FMGL achieves higher 
accuracies, demonstrating the effectiveness of FMGL for learning multiple graphical models. 




100 140 
Sample Size 



100 140 
Sample Size 



100 140 
Sample Size 



Figure 4: Comparison of FMGL and GLasso in detecting true edges. 



4.3 Real data 
4.3.1 ADHD-200 

The Attention Deficit Hyperactivity Disorder (ADHD) affects at least 5-10% of school-age children with 
annual costs exceeding 36 billion/year in the United States. The ADHD-200 project has released resting- 
state functional magnetic resonance images (fMRI) of 491 typically developing children and 285 ADHD 
children, aiming to encourage the research on ADHD. The data used in this experiment is the preprocessed 
data using the NIAK pipeline downloaded from neurobureau [21]. More details about the preprocessing 
strategy can be found in the same website. The dataset we choose includes 116 typically developing 
children (TDC), 29 ADHD-Combined (ADHD-C), and 49 ADHD-Inattentive (ADHD-I). There are 231 
time series and 2834 brain regions for each subject. We want to estimate the graphs of the three groups 
simultaneously. The sample covariance matrix is computed using all data from the same group. Since 
the number of brain regions p is 2834, obtaining the precision matrices is computationally intensive. We 
use this data to test the effectiveness of the proposed screening rule. Ai and A2 are set to 0.6 and 0.015. 
The convergence criterion is le-5. The computational time is about 4.13 hours without screening, and 
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172 seconds with screening, demonstrating the superiority of the screening rule. The obtained solution 
has 1443 blocks. The largest one including 634 features is given in Appendix C. 

The block structures of the FMGL solution are the same as those identified by the screening rule. The 
screening rule can be used to analyze the rough structures of the graphs. The cost of identifying blocks 
using the screening rule is negligible compared to that of estimating the graphs. For high-dimensional 
data such as ADHD-200, it is practical to use the screening rule to identify the block structure before 
estimating the large graphs. We use the screening rule to identify block structures on ADHD-200 data 
with varying Ai and A2. The size distribution is shown in Figure [5] We can observe that the number of 
blocks increases, and the size of blocks deceases when the regularization parameter value increases. 
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Figure 5: The size distribution of blocks (in logarithmic scale) identified by the proposed screening rule. 
The color represents the number of blocks of a specified size, (a): Ai varies from 0.5 to 0.95 with A2 fixed 
to 0.015. (b): A 2 varies from to 0.2 with Ai fixed to 0.55. 



4.3.2 FDG-PET 

In this experiment, we use FDG-PET images from 74 Alzheimer's disease (AD), 172 mild cognitive 
impairment (MCI), and 81 normal control (NC) subjects downloaded from the Alzheimer's disease neu- 
roimaging initiative ( ADNI) database. The different regions of the whole brain volume can be represented 
by 116 anatomical volumes of interest (AVOI), defined by Automated Anatomical Labeling (AAL) |22) . 
Then we extracted data from each of the 116 AVOIs, and derived the average of each AVOI for each 
subject. The 116 AVOIs can be categorized into 10 groups: prefrontal lobe, other parts of the frontal lobe, 
parietal lobe, occipital lobe, thalamus, insula, temporal lobe, corpus striatum, cerebellum, and vermis. 
More details about the categories can be found in [231 HI] • We remove two small groups (thalamus and 
insula) containing only 4 AVOIs in our experiments. 

To examine whether FMGL can effectively utilize the information of common structures, we randomly 
select g percent samples from each group, where g varies from 20 to 100 with a step size of 10. For each 
g, A2 is fixed to 0.1, and Ai is adjusted to make sure the number of edges in each group is about the 
same. We perform 500 replications for each g. The edges with probability larger than 0.85 are considered 
as stable edges. The results showing the numbers of stable edges are summarized in Figure |5] We can 
observe that FMGL is more stable than GLasso. When the sample size is too small (say 20%), there are 
only 20 stable edges in the graph of NC obtained by GLasso. But the graph of NC obtained by FMGL 
still has about 140 edges, illustrating the superiority of FMGL in stability. 

The brain connectivity models obtained by FMGL are shown in Figure [71 We can see that the number 
of connections within the prefrontal lobe significantly increases, and the number of connections within the 
temporal lobe significantly decreases from NC to AD, which are supported by previous literatures [241125) . 
The connections between the prefrontal and occipital lobes increase from NC to AD, and connections 
within cerebellum decrease. We can also find that the adjacent graphs are similar, indicating that FMGL 
can identify the common structures, but also keep the meaningful differences. 



5 Conclusion 

In this paper, we consider simultaneously estimating multiple graphical models by maximizing a fused 
penalized log likelihood. The blockwise coordinate descent method is employed to solve the fused multiple 
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Figure 6: The number of stable edges in NC, MCI, and AD. 
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Figure 7: Brain connection models with 265 edges: NC, MCI, and AD. In each figure, the diagonal 
blocks are prefrontal lobe, other parts of frontal lobe, parietal lobe, occipital lobe, temporal lobe, corpus 
striatum, cerebellum, and vermis respectively. 

graphical lasso. We have derived a set of necessary conditions for the FMGL solution to be block 
diagonal, and prove that they are also sufficient in the case of three graphs, extending the recent work 
in [T3]. A screening rule has been developed to enable the efficient estimation of large multiple graphs. 
Numerical experiments on synthetic and real data demonstrate the effectiveness of the proposed method 
and screening rule. Based on our extensive simulation studies, we conjecture that the proposed necessary 
conditions are also sufficient for the general case with more than 3. A future direction is to prove the 
necessary and sufficient conditions in the general case. 
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Appendix 

A. Optimization 

We solve the problem in Eq.© by the blockwise coordinate descent method. Consider the partition of @' fc ) 



e (h) =lo& k = l,...,K (16) 



2 

"21 u 22 



where is (p — 1) X (p — 1), is (p— 1) X 1, and is scalar. We solve and #22' > k = 1,...,K &t each 
iteration while fixing the rest. Then Eq.([2} is equivalent to the following problem: 

mm f; (- Iog(0g> - (OS'ffeg))" 1 ^) + « + -(sgW?) + 2P(8 12 ). (17) 

8 12 .S22 — 7 v ' 
1=1 

where 

p(e 12 ) = \iJ2 11^ ||i + a 2 £ \\e[f - eg+^Hx, 

k=l k=l 

012 = {e£\. . .,d[V), and 022 = {9&\. . .,0%P}. 
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Optimizing Eq, (117[) with respect to #22 yields 

1 

s 



if = iv+(ei?) T (®[?r 1 oi?, k = i,...,K. (is) 



°22 

5(fe) 



Plugging 0£> into Eq.([T7]), we can get the following optimization problem 



K 



6 12 = argmin^ (^^(©iiVM?) + M^fe^) + 2P(0 12 ). (19) 
i— 1 

The objective in Eq. (|19p consists of a smooth convex term and a nonsmooth convex penalty term. Many 
algorithms can be applied to solve Eq. (|19l) . In this paper, we employ the accelerated gradient method [17] and 
the fused lasso signal approximator [55] • From , it is easy to obtain 9^ based on Eq.|T5]). 

A.l Accelerated gradient method 

Denote Y = [0$, • • • , ©if 5 ] S TZ (p - 1)xK , and U = [s$, . . . , s^j ] £ ^ (p_1)xK , then Eq. JTSJl can be equivalent^ 
written as 

mm h(Y) := / (Y) + P(Y) (20) 

where /(Y) = (5^2 (^is ^©l^V 1 ^ ) + (s^V !^) • Note that /(Y) is quadratic, whose gradient is 

Lipschitz continuous with constant L, i.e. 

||V/(Yi) - V/(Y 2 )||f < L||Yi - Y 2 ||f,VY 1 , Y 2 G TZ (p ~ 1)xK 

where || • \\p denotes the Frobenius norm. 

We use the Nesterov's method [T7] to solve Eq. (|20|l . The Nesterov's method is based on two sequences {Y^} 
and {V;} in which {Y^} is the sequence of approximate solutions, {V^} is the sequence of search points, and i 
represents the i-th iteration. The search point Vj is the affine combination of Yi_i and Yj as 

V i = Y i + /3i(Y i -Y i _i), 

where /3j is a properly chosen coefficient. The approximate solution Yj+i is computed as the minimizer of the 
linearized function of h(Y) at V»: 

Y 1+1 = argminfti,«,v t (Z) := /(V,) + <Z - V,, V/(V;)) + P(Z) + §||Z - Vif F 

where Li is determined by line search so that Li should be appropriate for the search point V». By ignoring the 
terms that do not depend on Z, the above equation can be expressed equivalently as 

Y l+1 = argmm ^i||Z - (V, - ±-V f(Vi))\\% + P(Z). (21) 

Eq. (|2ip is well decoupled, and each row of Yi+i can be separately computed by the fused lasso signal approxi- 
mator [26] [27], i.e. 

K-l 



■' - argmini||z - t*|| 2 + y--]|z||i + ^ V \z k - z k +i\ 
* 2 Li Li f— ' 

K=l 



where y J is the j-th row of Yi+i, and t 3 is the j-th row of Vj — l/LjV/(Vi). 

The key steps of the accelerated gradient method to solve Eq. (|19p are summarized in Algorithm [T] 



Algorithm 1: Accelerated Gradient Method 

Input: U, Ai,A 2 ,Y ,io 
Output: Y, L 

Initialization: Yj = Yq, a_i = 0, ao — 1, and L = Lq; 
while Not Converged do 

Set A = Vj = Yi + ^(Y, - Yi-x). 

Find the smallest L = L~i_x,2Li-x, ■ ■ ■ such that h(Y i+ i) < hL,Vi(Yi+i) where Y,+i is 
computed using Eq. (p2"T|) . 
Set Li — L and a i+1 = l+vSlZ. 
end 

return Y i+ i,Lf, 
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A.2 Computing (0^ } ) 1 

Eq. (fT9)l involves the inverse of 0^' that can be efficiently computed. Let the covariance matrix W l ' = 
(© (fe) )~\ k = 1,...,K. Consider the same partition of W (fe) 



T (k) _ (W[ 1 > w 12 



CO w (fe) 

11 W 12 

[k) (k 

21 w, 22 



W (K) _ 11 "12 u — 1 X 

VV — I ( fc ) j fe \ | , K — J., . . . , A, 

Wo "" 



we have 



w« = (eW)- 1 + .W(eg))- 1 flg'(flWf(eg)r 1 1 

wW = -«W(eW)- 1 o2 ) > «'S ) =-2 ) (22) 

since 4a' = V(^m ~ (0i?) T (©u ) -1 0i2 )■ Thus > we can obtain (®n) _1 b y wW 

(eg'j-^wS'-wWtwg'f/^, (23) 

which only needs 0(p 2 ) operations. 

The outline of FMGL is shown in Algorithm [2) 



Algorithm 2: Fused Multiple Graphical Lasso (FMGL) 

Input: SW,fc = l,...,K,Xi,X 2 
Output: &W,WW,k = l,...,K 

Initialization: W< fe ) = diag(S( fc )) and 0<» = (WW)" 1 ; 
for j = l,2,...,p, l,2,...,p,... 

Compute (©n * = • • • > K according to Eg. ([25)1 . 

Solve Eq. (|19|) using results from the previous round as warm-start. Update 8^ and 9^ using 
Eq.®. 

Update 0< fc ' and using Eq.® so that 0( fc )w( fe ) = I pxp . 

Until Convergence; 
return & (k \W (k \k = l,...,K; 



A. 3 Computational complexity 

For each iteration, we compute (9jj?) -1 and update W^ fc ', which involves rank-one operations with total cost 
of 0{Kp 2 ) for K graphs. Each iteration of the accelerated gradient method involves computation of the gradient 
and the computation of p — 1 fused lasso signal approximators. The complexity of computing the gradient is 
0(Kp 2 ). The fused lasso signal approximator usually takes less than 10 iterations to converge [26] . The number 
of iterations in the accelerated gradient method to obtain an e solution is 0(l/y / e). Thus, the total complexity 
of each iteration is 0(Kp 2 / '\fe), and the complexity of each round (update all p columns/rows) of Algorithm [2] is 
0{Kp A /^l). 

B. Proof of Lemma [2] 

We split the problem into three cases. 
• Case 1: -3Ai < yi + y% + y 3 < -Ai 
Eq.© can be rewritten as 

-Ai - 2/i < ai < 2Ai + y-2 + 2/3, -2Ai - yi - 2/2 < a 2 < Xi + yz. (24) 
Eq.([7|) can be written as 

— 2Ai — yi — 7/3 < ai — a 2 < Ai + y 2 - (25) 

Based on Eq. ([24l . we have 

-2Ai - 2/1 - 2/3 < ai - «2 < 4Ai +yi + 2y 2 + ya. (26) 

From Eg . (|24h . we can see that the region of Eq. (|24ll is a square since 2Ai + y 2 + 2/3 + (Ai + yi) = 
Ai +2/3 + (2Ai +yi +y 2 )- Since 4Ai +yi + 2y 2 +y 3 > Xi+y 2 and -2Ai-2/i-2/3 = -2Ai - 2/1 - 2/3, Eqs. (J2S)) 
and (|26p intersect. Moreover, the region of Ea. ()25p belongs to that of Eq. (|26|) . and the width of the region 
of Eq. (|25p is half of that of Eq . (]26[) . which means that the diagonal BC belongs to the region of Eq. (|26[) . 
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• Case 2: — Ai < yi + yi + yz < Ai 
Eq.© can be rewritten as 

— Ai - yi < ai < Ai — yi, — Ai + y 3 < a 2 < \i + 2/3- (27) 

Eq.([7J) can be written as 

-Ai + y2 < ai - «2 < Ai + j/2. (28) 

Based on Eq. (|27|l . we have 

-2Ai - - 2/3 < ai - «2 < 2Ai - 2/1 - y3- (29) 
Similar to Case 1, we can obtain the same conclusion as in Case 1. 

• Case 3: Ai < y\ + 2/2 + 2/3 < 3Ai 
Eq.((6j) can be rewritten as 

— 2Ai + 2/2 + 2/3 < ai < Ai — 2/1, — Ai + 2/3 < a,2 < 2Ai — 2/1 — 2/2- (30) 

Eq.© can be written as 

— Ai + 2/2 < ai — 02 < 2Ai — 2/1 — j/3- (31) 

Based on Eg. (|30[) . we have 

-4Ai + 2/1 + 22/2 + 2/3 < a\ ~ a 2 < 2Ai - 2/1 - 2/3. (32) 
Similar to Case 1, we can obtain the same conclusion as in Case 1. 

C. Graph of ADHD-200 

Figure [8] shows a subgraph of ADHD-200 identified by FMGL with the screening rule. 




Figure 8: A subgraph of ADHD-200 identified by FMGL with the proposed screening rule. The grey 
edges are common edges among the three graphs; the red, green, and blue edges are the specific edges 
for TDC, ADHD-I, and ADHD-C respectively. 



14 



